Explore no more: Improved high-probability regret bounds for non-stochastic bandits
نویسنده
چکیده
This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample the losses of every arm at least Ω( √ T ) times over T rounds, which can adversely affect performance if many of the arms are obviously suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework. Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.
منابع مشابه
An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits
We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is O ( K √ n log n ) and against stochastic bandits the pseudo-regret is O ( ∑ i(log n)/∆i). We also show that no algorithm with O (log n) pseudo-regret against stochastic bandits can achieve Õ ( √ n) expected regret against adaptive...
متن کاملImproved Algorithms for Linear Stochastic Bandits
We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm f...
متن کاملUnimodal Bandits without Smoothness
We consider stochastic bandit problems with a continuum set of arms and where the expected re-ward is a continuous and unimodal function of the arm. No further assumption is made regarding thesmoothness and the structure of the expected reward function. We propose Stochastic Pentachotomy(SP), an algorithm for which we derive finite-time regret upper bounds. In particular, we sho...
متن کاملTight Regret Bounds for Stochastic Combinatorial Semi-Bandits A Proofs of Main Theorems
A Proofs of Main Theorems A.1 Proof of Lemma 1 Let Rt = R(At, wt) be the stochastic regret of CombUCB1 at time t, where At and wt are the solution and the weights of the items at time t, respectively. Furthermore, let Et = 9e 2 E : w̄(e) ŵTt 1(e)(e) ct 1,Tt 1(e) be the event that w̄(e) is outside of the high-probability confidence interval around ŵTt 1(e)(e) for some item e at time t; and let Et ...
متن کاملConservative Bandits
We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. While previous work addressed the problem under the weaker requirement of maintaining the revenue constraint only at a given fixed time in the future, the algori...
متن کامل